I want to share a take on grammar‑constrained generation I’ve been working on for a while. The punchline is:
You can compute “which LLM tokens are valid next?” by reading the parser stack through a precomputed weighted automaton—instead of simulating the parser for every candidate token.
At runtime, mask generation becomes a tight loop of:
AND / OR…and not “run the lexer+parser on a huge trie of token bytes every step.”
This post focuses on the runtime idea and the core compilation decomposition (token→terminal mapping, per‑terminal stack behavior, composition). The gnarly “how to actually build these automata without blowing up determinization/memory” details are real, and I’ll likely write those up separately.
You have a grammar (often “JSON Schema”, but let’s just say some CFG-ish constraint) and you want the LLM to produce syntactically valid output.
At each decoding step, the model produces logits over a fixed vocabulary (often ~100k–200k tokens). You sample one token, append it, repeat.
Constrained decoding means: don’t sample tokens that would make the output impossible to complete into a valid string.
Formally, if is the already‑generated token prefix, token is valid if it can still be extended to a full grammar string:
So the mask we want is:
This “exists a completion” definition is the right one for decoding: you don’t just want “does the prefix parse so far?”; you want “did I dead‑end the grammar?”.
Think of the constraint system as two functions:
get_mask(state) -> bitset(V)commit(state, token) -> stateAt runtime, decoding looks like:
repeat:
(GPU) compute logits
(CPU) compute mask
apply mask to logits
sample token
commit token to both:
- the model (KV cache / etc.)
- the constraint state
You can overlap GPU and CPU work, but the mask is still on the critical path: if it’s late, the GPU waits.
Inference is batched. A single “slow” request can introduce bubbles or stall a batch scheduler.
So the question isn’t:
“Can we compute masks fast on average?”
It’s closer to:
“Can we compute masks under a fixed budget every single step?”
If you’re running high‑throughput decoding, your per‑token time budget can easily be in the sub‑millisecond to low‑millisecond range per stream. Even if your GPU forward pass is longer than 1ms, mask generation still needs to be stable: schedulers and batching strategies hate unpredictable tail latency.
Mask generation is also frequently branchy and cache‑unfriendly in naïve designs—exactly the kind of workload that looks fine at p50 and awful at p99.9.
The commit(token) side is “incremental parsing”: you feed the newly generated bytes to a lexer+parser and update the parse configuration.
The parsing model I’m assuming here is:
The important pieces:
ACTION[top_state, lookahead_terminal].GOTO.So the constraint state is something like:
If you’ve seen tree‑sitter, Lezer, or classic GLR literature, this is that world.
PLUS, IDENT, STRING, etc.).Given the current constraint state, we want a mask over LLM tokens.
A naïve statement is:
“A token is allowed if appending it keeps the parse valid.”
But the parser doesn’t consume LLM tokens. It consumes grammar terminals.
And the mapping is messy:
"true," might be TRUE then COMMA).So “check token ” actually means:
Doing that for a single token is manageable.
Doing that for 200k tokens every step is not.
A common direction is:
This avoids a literal “for each token” loop and shares prefixes.
It can work decently for many grammars.
But worst‑case behavior is still brutal, for a few reasons:
Modern vocabularies contain weird long tokens: repeated punctuation, long runs of spaces, long runs of digits, etc. (If you’ve inspected cl100k_base/cl200k_base, you’ve seen these.)
If a token is, say, 80–120 bytes long and those bytes are syntactically valid in the current context, then a trie traversal must still walk deep to confirm it.
This is already annoying even for deterministic parsing.
In LR parsing, a single terminal can trigger:
Those reductions pop stack frames and consult GOTOs, and in GLR they can create fan‑out.
Even if average behavior is fine, the tail can be dominated by:
If your mask algorithm is “simulate parsing along many trie edges”, you’re multiplying this work.
In p99.9 land you want inner loops that look like:
Trie traversal + GLR is the opposite: pointer chasing, branching, irregular memory access.
I tried hard to engineer this style of approach into something stable. In my experience it’s a dead end if you care about strict, predictable latency.
Token validity depends on the current parse configuration—especially what’s on the LR stack.
So instead of asking:
“For each token , does it work on this stack?”
ask:
“Given this stack, which tokens work?”
That suggests a precomputed function:
…which we can evaluate fast at runtime.
The key enabling fact from LR theory is:
The set of viable prefixes (strings that can appear on the parser stack during a valid parse) forms a regular language (Knuth).
Informally: although the grammar language is context‑free, the space of possible stacks is regular, recognized by a finite automaton (the LR(0) automaton).
So there exists a finite automaton that can read (a suffix of) the stack and decide things about future validity.
We’ll use that to compute a mask, not just accept/reject.
Here’s the core data structure:
Think “finite automaton with bitset weights.”
We use bitsets with:
This fits the intuition:
At mask time, we:
This is the “read‑the‑stack” idea.
If the automaton is deterministic, the conceptual runtime is almost embarrassingly simple:
def get_mask_single_stack(dwa, lexer_state, stack_top_to_bottom):
q = dwa.start_state_for_lexer(lexer_state)
mask = ALL_TOKENS_BITSET
for sid in stack_top_to_bottom:
q, w = dwa.step(q, sid) # table lookup
mask &= w # bitset AND
if dwa.can_stop_here(q, mask):
break
mask &= dwa.final_weight(q)
return mask
In practice you may still track a small frontier (e.g., multiple lexer states, or running over a GSS), but the inner operations stay “lookup + AND/OR”.
With GLR you don’t have one stack; you have a DAG of stacks (the GSS). Each path from a top node down to the root corresponds to one possible LR stack.
For get_mask, you want:
a token is valid if it is valid on any surviving parse stack.
So we run the same propagation, but over a graph:
The shape is the same: along edges, at merges.
You can implement it as a dynamic program / worklist over (gss_node, automaton_state) pairs. The key is: it’s still just table lookups and bitset ops; no “simulate parser transitions for every vocab token”.
Two different (but compatible) reasons we can stop early:
A useful way to engineer this is to make the automaton explicitly track “already accepted tokens” versus “still undecided tokens,” so you can stop when the undecided set becomes empty. Conceptually:
This is one of those optimizations that’s obvious in hindsight: you want to avoid doing work on bits whose fate is already fixed.
So far I’ve described the runtime object as if it just exists.
The compilation story has three pieces:
I’ll sketch each. This is the minimum detail needed to make the idea “real” without diving into all the compile-time engineering.
A grammar is defined over terminals. A language model produces LLM tokens.
These symbol sets don’t line up.
So we precompute, for each lexer/tokenizer state, which vocabulary tokens can produce which terminal sequences.
+ but it could become ++ if I see another +”).Build a trie over all vocabulary token byte sequences. Then simulate the grammar lexer over the trie.
Whenever the lexer recognizes a terminal on some trie path, record:
This gives you an automaton whose:
Then determinize it (subset construction) to merge equivalent lexer configurations.
The result is: given a starting lexer state, you can walk through “what terminals could this token emit?” while carrying along exactly which vocab tokens correspond to those terminals.
You can think of this as a reversed transducer: instead of mapping token → terminals, it maps terminal-sequence paths → sets of tokens that realize them.
Now assume a grammar terminal has been recognized.
What happens to the LR parser stack?
In LR parsing, on lookahead terminal , the parser consults the ACTION table at the current top-of-stack state:
So before you can shift , you may need a chain of reductions. That chain depends on deeper stack elements because reductions pop.
Even though the LR stack can be arbitrarily deep, the behavior “what reductions happen before shifting ?” depends only on some suffix of the stack.
You can capture all possible reduction chains for terminal as a finite graph that:
This is basically a finite automaton with a “stack effect” register.
It’s convenient to represent stack operations as a word over symbols:
Concatenating two terminal effects concatenates these words.
Then you apply the obvious cancellation rule:
Algebraically this is the polycyclic monoid; practically it’s “well-typed push/pop cancellation.”
Why do we care? Because LLM tokens often correspond to multiple terminals; composing per-terminal stack effects and canceling adjacent push/pop pairs massively simplifies what you need to inspect on the original stack.
For each terminal , we build a small NFA-like object whose accepting paths correspond to:
This automaton is derived directly from the LR parse table: edges correspond to “this state shifts ” or “this state reduces by production ” and so on.
Determinize it so that later composition is manageable.
Now we have:
To get “stack → vocab mask”:
This final structure is what I call the Parser DWA.
For get_mask, we only care whether a token is valid, not exactly what new stack would result.
The actual stack update is handled by commit(token) using the real lexer+parser.
So during compilation, after composing and canceling, any remaining “push” symbols that represent “the parser would push state X” can be discarded from the mask automaton. The runtime commit will do the true stack manipulation.
This is also what lets you do useful things like: start from an arbitrary prompt prefix (maybe partially through a JSON value) and commit raw bytes without needing to align to LLM token boundaries.
There are two complementary views here:
A parser can, in principle, take arbitrarily long chains of reductions without consuming input if the grammar has certain recursive patterns (right recursion / hidden left recursion). There’s classic work (Aycock & Horspool) characterizing when reduction chains can be unbounded and how to transform grammars to avoid it.
After the usual normalizations, the number of reductions you need to consider before a shift becomes bounded by a constant (for that grammar). That implies: deciding token validity only requires looking at a bounded suffix of the stack.
Classical LR theory says viable prefixes form a regular language and are recognized by a finite automaton (the LR(0) automaton). Our Parser DWA is essentially a weighted extension of that automaton: instead of merely accepting/rejecting a stack configuration, it computes “which vocab tokens extend it.”
Regularity is the deep reason “finite automaton reading stack suffix” is even possible.
There’s one more annoying real-world detail: the grammar lexer is streaming, and “token boundaries” don’t align with LLM tokens.
Even with longest-match lexing (“maximal munch”), you can’t always commit a terminal the moment it matches, because it might still be extendable.
Example:
"+" so far. That matches PLUS."+", then the correct token is INCREMENT ("++"), not PLUS then PLUS.In a traditional lexer, you solve this with lookahead on the raw input stream. In incremental decoding, the raw input arrives token-by-token, so you need a streaming strategy.
When a terminal matches at some point but is still extendable, fork the lexer/parser state:
If later bytes allow a longer match, prune the premature branch.
This creates multiple active lexer/parser states—exactly what GLR + GSS machinery is good at representing compactly.
If you have multiple active parse configurations (due to lexer ambiguity, inhibited terminals, or LR conflicts), then:
an LLM token is allowed if it is allowed in any active configuration.
So get_mask unions the bitsets across branches.
That “union across alternatives” is exactly why the weighted automaton story uses union as the merge operator.
get_mask, honest commitAt runtime you do two very different kinds of work:
commit(token) (parser work)This is “real parsing.” It can be complicated and it inherits the usual GLR theoretical caveats.
get_mask(state) (automaton work)This is the crucial separation: mask generation never simulates reduction chains at runtime. That work has been compiled away into the automaton.
I’m not going to claim a formal optimality theorem here; parsing theory is too spiky for that.
But the division of labor feels “asymptotically right”:
Also: you can’t avoid touching bits if the output is literally a -bit mask. The goal is to make that bit touching occur in a small, fixed number of AND/ORs, not inside a giant vocabulary loop with parsing simulation.
Everything above punts on the hardest engineering fact:
building these automata can blow up if you’re careless.
You’re doing determinization and minimization in a weighted / semiring-ish setting, and real grammars are large.
Most of my time went into making compilation time and memory sane:
That’s a whole separate writeup.
If you only remember one idea from this post, it’s this:
The result is a get_mask whose runtime behavior is predictable enough to care about in p99.9 land, while commit remains a standard incremental GLR parser update.
If/when I write the follow-up, it’ll be about the compile-time tricks: building the Terminal DWA, building the per-terminal templates, composing/canceling deterministically, and keeping the result small.